Introduction to Linear Regression

Md Zulquar Nain

Linear Regression

  • Linear Regression method is one of the most widely used and common methods examining the linear relationship of the dependent and independent variable(s).
  • There will be one dependent variable usually represented by \(Y\).
  • Independent variables may be one or more than one usually represented by \(Xs\).

The problem

  • We want to examine
  • does the life expectancy depends on the Income Level?
  • To measure the Income leevel, we will be using GDP Per Capita Income
  • Data obtained from https://data.worldbank.org/country/india

Importing the Data File

# importing data from `csv` file
datar <- read.csv("sdata.csv")
  • datar - name of the imported data file inR

  • sdata.csv name of the csv file being imported

Exploring the Dataset I

  • Class, structure and dimension of the dataset
# Structure of the data
str(datar)
'data.frame':   62 obs. of  3 variables:
 $ Year : int  1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
 $ LE   : num  45.2 45.4 45.7 45.9 46.2 ...
 $ GDPPC: num  165 168 168 175 183 ...
#Class of the data
class(datar)
[1] "data.frame"
# Dimension of the data
dim(datar)
[1] 62  3

Exploring the Dataset II

  • First n rows of observations of the data set
    • head(data.frame name, n)
  • Last n rows of observations of the data set
    • tail(data.frame name, n)
# View top two rows of the data
head(datar,2)
  Year     LE    GDPPC
1 1960 45.218 165.2733
2 1961 45.398 167.5203
# View bottom two rows 
tail(datar,2)
   Year    LE     GDPPC
61 2020 70.15  980.1808
62 2021 67.24 1060.4024

The Formula

  • The mathematical formula of the linear regression can be written as follow:

\[y = \beta_0 + \beta_1*x + u\]

  • We say \(y\) depends on \(x\) and read as \(y\) is equal to \(\beta_1\) times \(x\), plus a constant \(\beta_0\), plus an error term \(u\).”

  • When you have multiple independent variables, the equation can be written as \(y = \beta_0 + \beta_1\times x_1 + \beta_2\times x_2 + ... + \beta_n\times x_n\), where:

  • \(\beta_0\) is the intercept,

  • \(\beta_1, \beta_2, \cdots,\beta_n\) are the regression or slope coefficients associated with the predictors \(x_1, x_2, \cdots, x_n\).

  • \(u\) is the error term- the part of \(y\) that can be explained by the regression model

Visualization of Data

  • Before estimating a simple linear regression model, visualize the data to gain an understanding of the relationship.
  • Make use of scatterplot
  • the figure shows that there is positive relationship between life expectancy and Income level

Simple Lnear Regression Model: The estimation

  • Linear Regression in R can be estimated using lm function
  • The lm command takes the variables in the format:
  • lm([target/dependent var] ~ [predictor / independent var], data = [data source])
  • To know more use help(lm)

Estmation

rlm <- lm(formula = LE ~ GDPPC,
         data = datar)
summary(rlm)

Call:
lm(formula = LE ~ GDPPC, data = datar)

Residuals:
   Min     1Q Median     3Q    Max 
-8.627 -3.498  1.082  3.559  4.791 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 47.26149    0.93112   50.76   <2e-16 ***
GDPPC        0.02698    0.00193   13.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.02 on 60 degrees of freedom
Multiple R-squared:  0.7651,    Adjusted R-squared:  0.7612 
F-statistic: 195.4 on 1 and 60 DF,  p-value: < 2.2e-16

The Output

  • The summary outputs shows 6 components, including:

  • Call: Shows the function call used to compute the regression model.

  • Residuals: Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.

  • Coefficients: Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.

  • Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that are used to check how well the model fits to our data.

The output: Interpretation

  • The first step in interpreting the simple/multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.

  • In our example, it can be seen thatp-valueof theF-statisticis less than2.2e-16`, which is highly significant. This means that, predictor variables is significantly related to the outcome variable.

  • Next is Coefficients significance

The output: Interpretation

summary(rlm)$coeff
               Estimate  Std. Error  t value     Pr(>|t|)
(Intercept) 47.26148728 0.931115969 50.75790 5.373627e-51
GDPPC        0.02697652 0.001929822 13.97876 1.565652e-20
  • For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.

  • It can be seen that, change in income level is significantly associated to the changes in life expectancy in India.

  • For a given predictor variable, the coefficient \(\beta\) can be interpreted as the average effect on \(y\) of a one unit increase in predictor(\(x\)).

  • In our example as income (GDP per capita) increases by one 100 Rs, life expectancy is increasing by 2.69 years.

Model accuracy

  • Next step is to check how good is the model that means how well the model explains the data.

  • The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary:

  • Residual Standard Error (RSE),

  • R-squared \((R^2)\) and \(adjusted~R^2\),

  • F-statistic

Model accuracy

  • Residual Standard Error (RSE),
  • The RSE (or model sigma), corresponding to the prediction error, represents roughly the average difference between the observed outcome values and the predicted values by the model.
  • The lower the RSE the best the model fits to our data.
  • Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.
  • In this example, the RSE = 4.02, meaning that the observed values deviate from the predicted values by approximately 4.02 units in average.

Model accuracy

  • R-squared \((R^2)\) and \(adjusted~R^2\),
  • The R-squared \((R^2)\) ranges from 0 to 1 and represents the proportion of variation in the outcome variable that can be explained by the model predictor variables.
  • The \((R^2)\) measures, how well the model fits the data. The higher the R2, the better the model.
  • However, a problem with the \((R^2)\), is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the outcome.
  • A solution is to adjust the \((R^2)\) by taking into account the number of predictor variables.
  • the adjusted R-squared is better measure
  • An (adjusted) \((R^2)\) that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model.
  • In thi example, the adjusted R2 is 0.7612, which is good.

Resources

THANKS